System, apparatus, method, and computer program product for indexing a file

ABSTRACT

A search engine manages the indexing of web page contents and accepts user selection criteria to find and report hits that meet the search criteria. The inventive search engine has an associated crawler function wherein display images of the web pages are rendered and stored as snapshots, preferably when the pages are indexed. The search engine reports search results by composing an html page with links to the corresponding page hits and containing snapshot reduced size graphic images showing the web pages as they appeared when fetched and stored as snapshots.

More than one reissue application has been filed for the reissue of U.S.Pat. No. 6,643,641. The reissue applications are U.S. patent applicationSer. No. 11/266,750, filed Nov. 4, 2005 (the parent reissue application)and U.S. patent application Ser. No. 11/513,423 filed Aug. 31, 2006 (thepresent continuation reissue application of the parent reissueapplication), all of which are reissues of U.S. Pat. No. 6,643,641.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention concerns methods and apparatus for representing data filecontents for searching the data files and reporting selected data fileaddresses, especially hypertext markup language files accessed using anInternet search engine (i.e., Web pages). One process develops adatabase representing the text content of data files on a network.Another process renders graphic representations of the files accordingto a default configuration and stores a compressed graphic file foreach. A further process selects file hits according to user criteria andreports their addresses with associated presentation of the storedgraphic file.

2. Prior Art

A search engine is a useful facility for browsing the Internet or WorldWide Web. Popular browsers such as Microsoft Internet Explorer andNetscape Navigator display visual outputs using hypertext markuplanguage or “html.” An enormous variety of information is stored in htmlformat in subscriber homepages and the like on the Web, and much of theinformation is accessible on the Web by simply pointing one's browser tothe associated page or file. Html files typically contain, for example,text and numeric information, typographical symbols, informationdefining formatting particulars by which the text is to appear on adisplay of the file, and uniform resource location references (URLs),which are hypertext links that address other files. Some of the URLsaddress or point to other hypertext pages that are linked to a displayedpage. The user can highlight and select a URL by pointing and clickingusing his/her mouse, whereupon the browser loads and displays theidentified page. Alternatively, the link may be such that thispoint-and-click method causes the browser to jump to a display of adifferent position in the file, or to perform an identified action suchas downloading and playing an audio or video file, or may cause thebrowser to alter its display of the present data, such as inserting orenlarging a display of a graphic file. The link may also cause thebrowser to invoke an applications program or a process, etc.

The html files which are addressed typically contain certain formattinginformation. All users who download the html file obtain the identicalfile and formatting. However, the display and processing of the files isnot necessarily the same from one user's browser to another. The htmlpage does not contain a fixed graphic data display. The html pagecontains text, addresses and encoding information which are processed bythe browser and the system operating the browser, to prepare and presenta graphic data display.

Browsers from different software suppliers are not identical and operatesomewhat differently. The same browser program can be set up by useroptions for display of data in selected ways, including for examplechoices of font size and font type. There are also alternative choicesfor applications programs that may be run within the browser (oftencalled plug-ins) or which are invoked when a file of a particular typeis selected.

Using font size as an example, the operating system (e.g., MicrosoftWindows) and the display may be configured to employ a certain X-Y pixelsize and color display resolution. In the browser, the user may haveselected one of several available font sizes, which in combination withthe X-Y pixel size of the display field determines the vertical andhorizontal size of each character. These choices affect pagination andthe layout of text within text subdivisions such as paragraphs ortables. The browser may allow the user to select a default characteralphabet. The browser may also allow the user to select how and whetherbackground and foreground colors are displayed, or whether colors areeven used in certain situations, such as to distinguish links from othertext or to highlight a link when selected by the cursor or mouse.

The typical html source file contains text and may include or containaddresses identifying static or dynamic files and information, but thesource files are usually not limited to text. The source files containheader, footer, paragraph and section markers, font and color changeswhich may distinguish sections, markers indicating text strings to beinterpreted as html links (URL addresses that are delineated as such),and other formatting and instructions. These and other markers, whichinclude hidden text tags and textual start/stop markers, are notthemselves displayed but instead are used to carry undisplayedinformation or as specifications for display of the remaining textaccording to preset rules and configuration choices in the browser andthe operating system.

Users often refer to the display of a particular web page as “going to”the web page. In fact, “going to” the web page is a misnomer. Theprocess actually involves sending a message to a remote server or userstation on the web that requests transmission of the html source codestored there. Upon receipt the source code is processed locally by thebrowser so as to produce data representing a graphic display. Thegraphic display data is stored in a memory buffer in the system RAM orin an associated display driver card from which the luminance,saturation and hue of each pixel in the display are determined. After“going to” a web page, the browser may store a copy of the source codelocally so that using the “Back” function reloads the page without theneed to wait for another exchange of messages over the Web.

Users may know the URL for a web site they wish to load, but also mayneed to find files with selected content without knowing thecorresponding URL. For this purpose the user can “search the Web” usinga search engine. Early search engines did live web page searches andcame to be known as “web crawlers.” The number of searchable pages hasmultiplied, however, and it would be an immensely large job to attemptto address, load and search all the possible URLs that might identify aweb page today. This web crawling method is now impractical foron-demand searching.

Search engines now operating do not search web pages on demand. Insteadthe search engine operators use various means to build a limiteddatabase reflecting the contents of a number of web pages. The users'search criteria are applied to the database to identify the addresses ofweb pages that meet the search criteria, at least from a subset of allexisting web pages. Web page content can be changed. The search iscurrent up to the most recent time at which the search engine databasewas updated to reflect the latest content of the web pages subject tosearch.

The web pages to be reflected in the database are indexed to build arecord of the terms that appear in each web page. Search engines varybut typically the index database reflects at least the presence ofsingle words to enable selection by Boolean combinations. At least someproximity relationships and/or the presence of exact phrases can be madesearchable. The indexing can include a selection of field information,such as revision dates, country of domain and other fields, which insome cases are automatically generated and in others require humanreview (e.g., to define a business category).

The search engine operator can use various methods to find or select webpage addresses that will be loaded and analyzed or indexed in buildingthe database. The methods may be chosen to expand or to limit the numberof web pages that the search engine will access. As a result, theresults of searches vary among the different search engines.

For example a web crawler or similar routine might attempt to load andanalyze pages corresponding to all the top level domain names that arefound to be registered with public domain name services or listed in adirectory service [e.g., http://www.[domain].com]. Search engineservices also can queue for indexing all pages that they arespecifically requested to index (which request might be submitted by thepage owner or another).

When indexing an initial collection of web pages, the list can beexpanded by parsing the received pages for hypertext links and URLaddresses that identify additional pages, and then loading and analyzingall the pages that are connected to the initial pages in that way. Thisprocess can be extended indefinitely. A smaller set of pages might beobtained by only indexing the top level pages or only links to top levelpages out to a certain number of links from the originally targetedpage.

Examples of search engines include Hotbot, Alta Vista, Yahoo,NorthernLight, Excite, etc. In addition, there are some search engineportals that run the same user query through a plurality of other searchengines. The search engine comprises a processor that maintains a webpage which the user loads by aiming his browser at the search engine URL(e.g., Excite's URL is http://www.excite.com/). The received page(namely the processed version of the html source code that is displayed)typically includes one or more Common Gateway Interface (CGI) boxes orsimilar form processing means by which a user who wishes to make asearch enters one or more letter strings as search criteria. Booleancombinations of two or more strings often can be included or will beimplied if not stated. The criteria typically are construed met if thespecified words or phrases are found anywhere in the html source code ofthe target pages when last indexed. This includes portions that are notdisplayed (e.g., meta-tags and comments). The criteria can specifyattributes other than the presence anywhere of a certain text string.This may be helpful, for example, to limit search results to findingfiles of a certain type (e.g., with URLs linking to a certain fileextension type to find a certain kind of media). The criteria can alsobracket out files in a selected date window.

The search engine compares the criteria to available information for webpages and sends to the user a report identifying the web pages that meetthe criteria. The report to the user is transmitted in html source code.To generate the report, the search engine finds URLs for the selectedweb pages and inserts a list of these URLs into a shell form (i.e., an“empty” html source code file). The shell form has text and formattingto display title headers, possibly also ad banners and similarinformation. The URL list that is produced is inserted into the htmlshell. Each URL is flagged in the html source as identifying an htmllink (href=[etc.]). Thus when the list is displayed by the usersbrowser, the user can select among the results and point and click orsimilarly highlight and invoke the html link addressing the page thatthe search engine considered to meet the user's criteria. This thenloads the html source code directly from the remote page that wasselected and the browser displays the current contents of the referencedweb page according the html source code found there at that time.

After running a search and loading the web page referenced in a URL thatis mentioned by the search engine as meeting the search criteria, it isnot unusual that the user may not find the loaded web page to containthe terms used as the search criteria. This occurs because the contentof the page was changed to eliminate the search term between the timethat it was indexed by the search engine and loaded by the user who ranthe search. For the same reasons, linked pages that are reported by asearch engine sometimes no longer exist.

It would be possible to employ a web crawler process not only to findand index web pages but also to update the pages already indexed. Thejob of indexing web pages is growing constantly, and the job of alsorevising indexing work that already has been completed is that muchlarger of a job. The operator of the search engine must make somedecisions on allocating available resources of memory, processing powerand communication bandwidth to the jobs of seeking out web pages,indexing and storing usefully complete database information on thepages, and updating their database, as well as to handle user searchrequests and reports.

The typical search engine reports more to the search than the URLs ofthe indexed pages that meet the searcher's selection criteria. The URLsthemselves, which are formatted as hypertext links in the search report,sometimes provide information as to whether or not a search hit ispertinent to the user's desires. For example the domain name associatedwith the page may identify an owner known to be in a pertinent business,or on the contrary may show that the search result is plainly notrelevant to the search. The search engine typically also stores andincludes in the search report listing one or two of the first lines ofthe web page that is referenced, which frequently includes a title thatmay be helpful to show quickly whether the selected page is of interest.The search listing also may show the date at which the web page was lastupdated or the date that it was indexed.

The usual success rate in finding a pertinent page or website in one tryor only a few tries is actually rather low. The success rate varies withthe subject matter, but in a typical search the user's search criteriamay turn out to be unduly broad and may select so many pages that theycannot all be reviewed, or may be so narrow that much desired content isexcluded, either of which can be an unsatisfactory and perhapsfrustrating experience. Balancing the needs to include relevant materialand to exclude irrelevant material can result in a substantialexpenditure of time, much of which is effectively wasted.

It would be advantageous if the presentation of search results could besupplemented to more effectively assist a user running a search toquickly and meaningfully separate the pertinent and irrelevant results.However, such a capability will only be useful if it can be accomplishedwithout unduly adding processing time and storage requirements to thesteps involved in preparing database information for search and inpresenting the results to the user.

SUMMARY OF THE INVENTION

It is an object of the invention to provide an abbreviatedrepresentation of searchable data files, in particularInternet/Intranet/Extranet html data pages, which represents their textand linked graphics in a visual snapshot form to supplementrepresentations such as introductory text passages and URL addresses. Itis a further object to collect and process the necessary informationbefore conducting searches and to store a relatively small graphic filein association with the search database for representing each potentialhit. The respective graphics file is reported to the user when a searchresults in a hit on the file, namely by inserting a hyperlink to thestored file in the search report sent to the user as the search results.

It is another object of the invention to overcome problems associatedwith the fact that different user configurations result in differencesin the manner of displaying files, by preparing a graphic snapshotpresentation as described, according to a default set of configurationparameters. Such parameters can specify font type and sizes, colors,backgrounds, screen pixel resolution and the like.

It is a further object to generate and store such an abbreviated visualpresentation or shapshot as part of the process of building one or moredatabases using a web crawler or automated information review process tofind and load or otherwise accept and process html pages. Preferablypreviously processed pages are again accessed and the database isperiodically updated. Optionally, the abbreviated snapshotrepresentation can be provided in combination with or in lieu of atabular listing of the associated hypertext link and perhaps also anintroductory portion of the text of the html pages. A hypertext link canbe associated with the graphic snapshot such that the user (searcher)can point and click on the graphic to load and view the associated webpage.

It is another object to permit such snapshot representation to beinitially processed, or reloaded, processed and updated at times or at afrequency that is different from that at which the web crawler databaseis updated with respect to the text content of the web pages.

These and other objects are accomplished by the improved search engineof the invention, for managing user search and selection of data filesstored at distributed systems coupled at network addresses. Inparticular the search engine is effective to improve searching ofhypertext web pages on the Internet. The search engine has an associatedweb crawler operable to address and load successive web pages, and toindex text data associated with the successive web pages. In this mannerthe search engine obtains parameter information such as words appearingin documents, word proximity and other information that can be used todistinguish at least groups of the web pages from one another whenconducting a search. The web crawler stores the parameter information ina manner that cross references the paramater information with theassociated web addresses or URLs of the web pages. The search engineaccepts user-submitted search criteria and conducts a search or theparameter information to select the associated addresses of web pagesthat met all or part of the search criteria. The results can potentiallybe ranked, subdivided into categories and similarly handled according toknown search engine operation. According to an inventive aspect, inconjunction with obtaining the parameter information for at least asubset of the web pages subject to search, the crawler renders a displayimage of the web page that is being indexed, and processes the image toprovide a reduced size graphic image file corresponding to a staticvisual presentation of each of the indexed web pages. This graphic imagefile preferably is stored in a compressed graphic file format such asGIF, JPG, or a similar file, the file address or URL of which is storedand cross referenced to the criteria in the database that identifies thecorresponding web page. When a search is conducted and results in a hiton a web page, its graphic snapshot is linked to the search resultsreported to the user. In a preferred embodiment, acceptance of the usersearch criteria and reporting of the results are handled by html pageexchange communications between the search engine and the user. Thesearch engine is accessed by the user and provides a form page havingCGI boxes or the like for accepting text and/or other selections fromthe user. The search engine conducts a search which identifies one ormore hits that are reported to the user by sending an html searchresults page. The search results page is composed by the search engineas a function of the search results and may contain no hits or a numberof hits. Each of the hits is identified in the search results by thegraphic snapshot, and preferably also by text information that reflectsthe content of the web page hit. Preferably, the search results page iscomposed to include a hypertext link to the URL address where thegraphic snapshot file has been stored by theweb-crawler/database/search-engine processes, for example by an IMGSRC=[path\filename] command inserted in html source code. As a result,the image file is loaded by the user's browser when processing thesearch results page, which generally occurs after the display of texthas been accomplished.

As a result, the search results appearing on the user's browser includelinks to the web pages that were found to meet the criteria (hits), andalso a snapshot graphic image of the way that the web page appeared whenrendered at the time of indexing.

The invention is applicable to a wide range of search systems. Forexample, in addition to use with a web crawler and a text indexed wordassociation database (or instead of automated text indexing), theinvention is applicable to produce and associate representative graphicsnapshots with websites that reside in a human reviewed directory suchas Yahoo, wherein subjective characteristics of the data (a text form ofwhich is sometimes termed “descriptors”) are stored in the database forcomparison with user criteria in finding hits. In that situationcharacteristics such as an arbitrary business or art classification maycategorize the web pages for selection in a manner similar to textstring aspects used such as the presence of selected strings, wordassociations, proximity and the like. The invention is also applicableto automated categorizing processes such as used by Northern Light.

According to an inventive aspect, the graphic image file that isproduced is not necessarily identical to the appearance of the page whenultimately loaded by the user after a search. In addition to the factthat the web page may have changed since it was rendered into thegraphic file, the rendering is accomplished according to a predetermineddisplay configuration of the crawler when rendered. Nevertheless, thegraphic is a useful and very quick means for a user to sift throughsearch results and determine immediately whether or not at least some ofthe hits bear further investigation.

BRIEF DESCRIPTION OF THE DRAWINGS

There are shown in the drawings certain non-limiting examplesillustrating embodiments of the invention as presently preferred. Thesame reference numbers are used throughout the drawings to identifycorresponding elements in the respective figures.

FIG. 1 is a schematic block diagram illustrating a first embodiment ofthe invention.

FIG. 2 is a block diagram illustrating the elements associated withcollecting, processing and organizing a database of informationaccording to the invention, to be used to conduct searches.

FIG. 3 is a block diagram illustrating operation of the invention inconnection with executing and reporting the results of searches.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

According to the invention as generally shown in FIGS. 1-3, thereporting of search results by a search engine 20, is improved andfacilitated by offering each searcher or user 30 a visual representation35 of the web pages found to meet the user's search criteria submittedto the search engine. The invention is particularly applicable to anInternet search engine but can also be applied to other networks 50where the search engine 20 is available for managing user search andselection of web pages or similar files, stored at distributed systems52 coupled to the network. The web pages, which may be considered datafiles, are found at addresses to which the search engine can link toload the data files, for example being accessible using URL addressingof the pages as hypertext markup language (html), file transfer protocol(ftp), telnet or other such file types. The data files may have embeddedlinks to other data file or to graphics or other media files. The searchengine 20 of the invention accepts user queries that characterize filesof interest, searches for the files and reports to each such user theresults of the search including network addresses of the files found toat least partly meet the query, enabling the user to link directly tothe files, and also a snapshot of how the file will appear according tothe most recent rendering performed by the crawler of the search engine.

The invention is described in this disclosure with primary reference tothe preferred application to an Internet coupled search engine in whichthe data files searched are html pages on the Internet or worldwide web50. Insofar as such files are accessible for loading and review by otherusers via browsers and search engines, they generally contain hypertextmarkup language (html) text, comments or tags, formatting commands, andlinks addressing other files. The data may contain text, media, scripts,programs, etc., and may be addressable at the same network address or adifferent address. The files may contain information that is notdisplayed when rendering the file, but nevertheless can be used tocategorize the content of the files.

In the preferred example, the basic files (e.g., web pages), as well asthe other files and systems to which they refer, are addressable usingstandard uniform resource locator (URL) addresses, containing a high,mid and low level domain name that is resolvable by a domain name serverinto a numeric Transmission Control Protocol/Internet Protocol (TCP/IP)address by which packets of data are directed from one computer systemon the Internet to another. In this case such packets as transmitted tothe system 52 containing the web page to be subject to search requesttransmission of an addressed web page (see FIG. 1). That system 52responds by transmitting the contents addressed. The packets arereassembled for or by the receiving system. The browser or a similarprocess of the receiving system processes the data, normally but notnecessarily for visual display on a local monitor.

Although described with respect to browser searching on the Internet,the invention is likewise applicable to other environments such assearching within a company intranet or other group of accessible datastores which have a visual aspect. The invention is also applicable toplatforms and user interfaces other than PCs and browsers, such as thevarious Unix processes which are run on PCs or mainframes, etc.Furthermore, the invention is applicable to various wirelesscommunication architectures. These environments and platforms are notlimited to consumer and business use, and have applications intechnical, military and other situations as well.

A block diagram showing an improved Internet search engine 20 accordingto the invention, for managing user search and selection web pagesstored at distributed systems 52 coupled at network addresses to theInternet 50 or the like, is shown generally in FIG. 1. FIG. 2illustrates a succession of method steps and/or programmed operations ofthe system for building and adding to or updating a database 62 ofsearchable information. FIG. 3 illustrates a method and apparatus forconducting searches by accepting user queries 54, conducting searches ofthe database 62 and reporting search results in the form of a composedsearch report 80 containing visual representations or snapshots 35 thatdepict a presentation of how the selected pages would have appearedaccording to a default display configuration at the time they wereaccessed by the crawler 60.

It should be appreciated that the invention is discussed in connectionwith processes organized in functional blocks in the drawings. Thisillustration is helpful to illustrate the input and output sources anddestinations, the operational steps undertaken, the various memorystores and data types involved and other aspects. However theillustration is not intended to exclude arrangements, for example,wherein separately illustrated units are sequential operations of thesame processing element or wherein illustrated functions or storagecapacity are distributed over separated units, especially separateprocessors coupled to a common network. The separately illustrated orcommonly illustrated elements can be combined or separated asconvenient, without departing from the invention and while serving thesame functions.

The search engine 20 in the embodiment shown in FIG. 1 has an associatedweb crawler 60 operable to address and load successive web pages fromremote servers 52 on network 50, and to index or to otherwise accept orgenerate descriptors that characterize text data associated with thesuccessive web pages that are loaded. In this way crawler 60 developsparameter information on the successive web pages that can distinguishat least groups of the web pages from one another, and at times can beused selectively to identify a single web page, provided some encodedaspect of that page is unique among the pages loaded and processed. Thecrawler 60 stores the parameter information and associated addresses ofthe web pages as a database 62 in a storage medium 64 that is accessibleto a search processor 78 that accepts the user criteria 54 and preparesand sends search reports 80 to the query submitting user 30. The searchengine portal or processor 78 responds to user submitted search criteriaby searching the parameter information in the database 62 and reportingto user 30 at least the associated addresses of data files that met thesearch criteria when indexed. In particular, search portal/processor 78reports the URL addresses 82 of web pages meeting the user criteria.

The web pages are generally maintained on web servers 52 (FIG. 1) thatare “remote” from the querying user 30 and from the search engine 78,but actually could be anywhere that is addressable on the particularnetwork, including on the user's own system. The web servers 52, inknown manner, store text and graphic data or addresses of graphic datafound elsewhere. That information is available upon request and in thecase of the Internet and other TCP/IP protocol type networks istransmitted in packet form to any user that requests the web page bydirecting a request to the web server identifying the TCP/IP address ofthe web server 52, the sender's address or identity, and the address ofthe desired page. This normally involves addressing using URLs thatidentify the type of communication desired, such as transmission of anhtml page (versus a linked graphic or media file, or perhaps a differenttype of interface such as ftp or telnet), and an address that representsthe domain name and a subdirectory path leading to the actual html fileor other file.

The same sort of URL addressing is used internally in html pages toaddress image and other files that may be located at the same web serveror elsewhere on the worldwide web, namely by providing a hyperlink thatstates the network address of the text or other content, as opposed tocontaining the content itself. Such hyperlinks can also be invoked tomove around in a given file, for example from one subheading to another.The hyperlinks are embodied by automatically recognizable codes (e.g.,“href=” or “img src=”) that appear in the source code together with thevarious start and stop tags that specify text formatting, colors andother aspects of the page as it should be displayed, for example using abrowser. In a browser such as MS Internet Explorer, the source of adisplayed page can be displayed by selecting “View” and “Source” fromthe toolbar.

According to the invention, a crawler 60 collects web page data and isgenerally shown in FIG. 2. Crawler 60 can be operated preliminarily butpreferably operates continuously during operation of the othercomponents to collect additional data and/or to update data alreadycollected. Crawler 60 has one or more fetching processes 66, severalbeing shown in FIG. 1 and identified as Agent A (fetch) processes. Thecrawler 60 via its fetching processes 66 determines web pages to loadand attempts to load them. For example, the crawler 60 may test TCP/IPaddresses (known as scanning) or attempt to load pages from particulardomain name addresses where servers might be up and running, obtainedfor example, from a domain name server (not shown). The text portion ofany data obtained by the fetching processes 66 from a particular URLaddress is parsed or divided into discrete terms and statements. Theseterms and statements are compared to predetermined reserved terms andformats that represent URLs, file addresses and the like. When thecomparison indicates that a hyperlink to another file or web server hasbeen found (or that a given string so resembles a hyperlink as to beinterpreted as such), the found address is added to a list of addressesand an attempt is made in due course to load a file at that address,thus increasing the field of files that have been consulted.

The general function of the Agent A fetching processes 66 is to obtainthe files available from remote web servers 52 and to note the addressesof the files (URLs for the Internet) that when invoked will address andload the file. As a result of communication delays, it is preferred toemploy a plurality of concurrently active requests for files so that onefile can be processed while waiting to receive another. This aspect isrepresented in the drawing by plural Agent A processes 66, which obtainthe fetched files and store at least part of the fetched files in abuffer memory or queue 92. In connection with html web pages, the dataincludes html source code, addressed files containing images, audio orother media, which are stored in buffer 92 together with the addressesfrom which they were obtained.

The collected information from downloaded files, particularly textfiles, is processed according to a generally conventional textprocessing or categorizing technique 68 to build a text or descriptorindex in database 62 as shown. The database 62 contains an indexdeveloped from automatic analysis (generally “indexing”) or human review(categorization) of the text and other data, indexed to the URLs of thepages from which they were obtained. Insofar as the automatic or humangenerated descriptors and addresses are described herein as a “text”index, it should be appreciated that index might represent anyattributes of the content of the respective web sites, not limited towords in their displayed text. For example terms in hidden meta tags,comments in the source code, strings found in addresses and the like arealso potential data points that may be collected. Any arbitrarycharacterization that my be automatically assigned or assigned by ahuman reviewer can be deemed an indexed point. For example, the indexcould contain information as to the type of links found in the source,the date of the last update, the country of origin or language, whetherthe site appears to be academic or commercial, an entry for ratingcontent as adult or “general admission” for keying child protectioninterests, and so forth.

According to an inventive aspect, the crawler 60 that is operable toreceive the web pages and to extract the parameter information fromthem, generates a file 72 of graphic image data corresponding to anappearance of each of the web pages, which is stored, preferably as areduced-size and compressed image data file 75, in association with thedatabase data respecting the page. When search results are reported tothe user (FIG. 3), the search engine reports the associated URLaddresses 82 of web pages that met the search criteria in a conventionalmanner, preferably inserting a hypertext link to each identified pageinto an html page reported to the user, optionally a short descriptionor excerpt, and also inserts into the report page the graphic imagesnapshot file by inserting into the source of the report page a link tothe stored compressed graphic image file 75. The user's browser displaysthe search results in conventional form, namely by showing a selectablehyperlink to the addresses and optionally a description or excerpt, anddisplays a snapshot of how the identified page is likely to appear if orwhen it is loaded by the user's browser, should the user point and clickto the link to invoke the URL of the page hit.

The search portal 78 that performs the search by reference to thedatabase 62 in storage media 64, reports the search by composing a webpage containing the search results, assembling the search report usinghypertext markup language. The search report contains headers andinformation identifying the portal and perhaps contains advertising. Thesearch report also lists the hits that resulted from the search. Moreparticularly, the search engine inserts (in list or table form) a textstring showing the URL address of each web page hit (i.e., the pagesfound to meet the user criteria) together with a hypertext linkage tothat URL (e.g., an “href=” statement), causing the user's browser toshow a link that can be invoked (pointed and clicked) to load the pageat the stated address. Such a report is conventional in an html sourcesearch report. It typically also has a description or excerpt and may bearranged in a pyramid or hierarchy of categories. According to theforegoing inventive aspect, the search engine also inserts the URLaddress of the graphic file that has been processed by a further processidentified in FIG. 2 as Web Agent B 95, to contain a snapshotreduced/compressed graphic 35 representing the page hit.

The link to the compressed rendered graphic file can be made, forexample, by use of a IMG SRC=<domain>/<path><filename> command in thehtml source. The graphic can be associated with a hypertext link to thehit page URL as well as linking using an HREF=<URL of hit page> commandas mentioned above. As a result, the user's browser when displaying thesearch results also displays the graphic snapshot image, as shown inFIG. 3.

The invention has three main components, shown generally in FIG. 1. Asshown in FIG. 2, these include the crawler processes that fetch filesfrom web pages in the universe of web pages to be subject to search, andthe processes that index or catalog the pages and render the fetchedfiles into graphic image files. The processes in FIG. 2 can generally beconsidered the processes that obtain raw data and process it to providea searchable database and information that may be included in searchreports when a web page becomes a hit. Preferably, according to theinvention the crawler processes 66 that are associated with collectingthe raw data files, which experience communication delays, are separatefrom the processes 68, 95 that process the raw data into a form apt forstorage in database 62 in preparation for searching. FIG. 3 illustratesthe processes 78 that interface with a user who seeks to search the web50, including presenting the web page hit information to the user inhtml form for browser display.

Referring to FIG. 2., the search engine includes or is associated withweb crawler 60, which is an engine that conducts web page addressing,loading and analyzing, and stores representative data in a storagedevice 64 containing a database 62. The stored representative datacharacterizes the web pages that the crawler loads and that are analyzedfor content by process 68. Of the main activities to be effected by thesearch engine system (i.e., by the crawler and the search processor),preparation of database 62 allows a search to be conducted more quicklyby reference to the processed database information gleaned from thefield of possibly-selected files, than would be possible if the searchengine attempted to load and analyze the entire universe of files afterthe user had submitted query 54 (FIG. 3), namely while the user wasawaiting search results.

The process of preparing database 62 includes determining URLs (orperhaps TCP/IP addresses or other addressing strings) for the files tobe searched, and then loading and analyzing the files to note theoccurrence and juxtaposition of text strings. Alternatively or inaddition, the files are categorized for other aspects, for example byhuman review and assignment of arbitrary descriptor categories that tendto distinguish files by their content or owner or type, etc. The filesor webpages consist essentially of ASCII characters stored in a textfile that is known to be or is identified as hypertext markup language,often having an “htm” or “html” extension on the file name). As a resultthe ASCII character strings in the web page are searched forcombinations of characters that conform to specific code name andcharacter rules whereby they can be interpreted as commands or links orother specific forms of information in html.

Html is a form of standardized markup language in which various tags areassociated with ASCII character strings. Many of the character stringsand tags used in html webpages concern the appearance of the associatedtext and the visual aspects that are to be displayed concurrently withthe text. Such commands can specify a header, a background pattern,color or complete image, set or reset a font type, font size,capitalization or color, change justification, centering and margins,specify lines, a table or frames, call for insertion of a graphic figurein any of several formats, which may be static or animated, andotherwise generally vary the appearance of the page and the text on thepage. The strings also can address additional files.

The encoding of a representation of the occurrence and juxtaposition oftext strings is generally known as indexing, and results in a databaseof information in which each text string found during the analysis ofall the files or pages searched is referenced to the URL address wherethe files or pages can be found. According to the present invention,such indexing can be construed to include other methods for categorizingdata files in a manner that allows distinctions to be drawn that areuseful for searching, including human reviewer categorization anddiscrimination for non-text factors such as the revision date, countryof origin or the like.

The database 62 is generated by preparing or obtaining a set ofcharacterizing parameters concerning the fetched files, or theiraddresses or content or the like. Database 62 contains a cross referencebetween criteria and the identity (normally the URL address) of the filethat matches the criteria. Assuming that the criteria concerns aconcatenation of terms (e.g, “quick brown fox”), all the URLs of filesthat contain that string are available by searching for the string.Likewise the URLs of all the files containing the component terms areavailable (“quick” or “brown” or “fox”), and these terms or phrases canbe combined with other terms or arbitrary categorizations to find a page(such as the Quick Brown Fox Hardware Store). The indexing and/orcategorization particulars can be objective or arbitrary, and wholly orpartly driven by human review or by automated means, and can concern anyaspect that tends to be unique to individual files or common to subsetsof files only.

Automated indexing and similar characterization systems may seemobjective but the results are determined in part by usage chosen by theauthor of the content, which is to some extent arbitrary. Human reviewis subject to potentially arbitrary choices by the reviewer. The searchdatabase as discussed herein includes any collection of informationprepared in a manner that enables search criteria to be compared tostored criteria to distinguish files from one another. The searchcriteria involves combinations of categorizations and/or text stringsand other factors, chosen by the user in an effort to target the filesor pages that have a desired subject or include reference to aparticular datum. At the same time, each criterion is not applicable toevery page reviewed, and as a result it is possible both to collectfiles that meet a user's criteria and to eliminate files that do notmeet the criteria and thus are irrelevant to the particular search.

Referring again to FIG. 2, the universe of files and pages can comprise,for example, all the high level pages of registered domain names on theInternet, plus a series of additional lower level pages. The lower levelpages can include all the pages to which the high level pages are linkedby hyperlinks in the content of the high level pages and/or frequentlyencountered subpage names such as “index” and “home”. Various suchprocesses are conventionally practiced using so-called web crawlers thatare operated constantly, often during low traffic hours, to find, loadand analyze (index) a very large universe of web pages.

Conventional web crawlers prepare a database that records and can beused by searchers to select (or de-select) web pages primarily on textstrings and Boolean combinations of text strings found in the content ofthe web pages and indexed in the search engine database. The webcrawler/search engine database also can be arranged to record and permitsearchers to select and de-select on the type of media linked to a page,on a window of dates, the language of the web site or page, the locationof the registered domain, the depth of a particular web page in thedirectory structure of the target site, and other aspects.

Although it is possible and useful to encode and to select web pagesbased on attributes that are determined from letter strings found intheir text or perhaps in the particulars of their URL address, it is notreadily possible for an automated web crawler and associated processorto encode much of the appearance of a web page. In the event that theweb page contains a link to a graphic image file, for example, the URLaddress of the graphic image file, including its file name, will befound in that web page, but the graphic image could have any content andmay or may not be consistent with the file name. Therefore, known searchengines cannot discriminate among web sites by virtue of most of theattributes that affect the graphic appearance of a site's contents whendisplayed on a browser or the like. However users can readilydiscriminate among web sites, particularly some forms of web sites, byappearance only.

The configuration of the user's system also affects the appearance of aweb site content when displayed. On the level of the browser, the usercan opt to display particular font types, and also can specify fontsizes. These configuration choices affect the appearance of a retrievedpage even if the page defines specific fonts that are available to thebrowser. The browser may also permit the user to select whether or notto use the background colors of retrieved sites and other featuresaffecting the display. On the level of the operating system, the usercan opt for different display options such as the number of pixels andthe color resolution employed. These aspects also affect the display. Asa result of such user choices, retrieved web pages appear differently ondifferent user's displays when retrieved. For the most part, differencesdue to such configuration choices do not grossly affect the appearanceof the web site, but they do cause an identically encoded page to appeardifferently on differently configured systems and/or browsers.

The search/reporting steps of the browser, generally shown in FIG. 3,include accepting search criteria 54 from user 30, for example using aCGI script technique in which the user enters selections including textstrings, literal strings of plural terms, additional encoded aspectssuch as media types, date windows or limits, countries of origin, etc.The user may also select Boolean relationships (AND, OR, NOT, XOR). Thesearch portal may require commands or may permit selection usingpoint-and-click steps. The search engine compares the search criteria tothe pre-prepared database of information gleaned from the web pages ithas loaded and analyzed from the field. The results are reported to theuser by preparing and formatting an html source reporting page intowhich hyperlinks are entered that name and point to the addresses of thefiles that were found to meet the criteria. Often the report includesother information such as the date the page was last updated before itwas indexed, and a few lines of introductory text from the page, whichprovide a hint to assist the user in determining without loading thepage whether the page is likely to be relevant to the search. If theuser finds a link that appears to be pertinent, the user selects andengages the hyperlink. This causes the browser to load the html sourcefound at the URL address shown in the search report, and any referencedfiles and links therein. However, the page may have changed between thetime that the indexing was accomplished and may have totally differentcontent than it had when indexed. The page may no longer exist. In thosecases, the search fails except to advise the user that the page formerlyheld information that might have been of interest.

Deliberate as well as inadvertent “search engine corruption” sometimesoccurs. It may be crucial for marketing or other purposes for a web siteto be found in user searches on search engines, and it can be lucrativeor otherwise beneficial for a web site operator if his/her site isranked high in the search results for particular terms. Thus, a greatnumber of website operators have ways to misrepresent the content oftheir pages. Keywords intended to cause the page to be selected and torate highly in particular categories can be included and may or may notbe displayed. Misleading text can be placed in miniscule font at thebottom of a page or misleading text can be hidden by making it the samecolor as the background on which it appears. Text can also be placed in“ALT” descriptions of images and graphics, thereby indexed by thecrawler but not seen by the user. A particular term can be included oneor many times to improve rankings, by one of the foregoing techniques,or by overloading keywords in “META” tags included in web pages and notdisplayed. Another technique is to temporarily post a page to betextually indexed by the crawler/search engine and then to replace itscontent after it has been indexed, or similarly, meta-refreshing the webpage so as to redirect the user to another page address. According to anaspect of the present invention, the user can visually distinguish pageshaving undesired content and not waste time on them. Search enginecorruption using the aforementioned techniques to provide misleadingtext is averted due to the visual nature of the present invention.

According to an inventive aspect, a system of the type that indexes orcategorizes information on web pages for searching is improved byencoding and providing in the search report 80 a standardized graphicrepresentation 35 of the appearance and rendering of each page at thetime that the page is indexed. The graphic representation 35 preferablyis in the form of a compressed image of the page, described herein as asnapshot, stored in a standard compressed file graphics format at alocation accessible to the search portal process 78. The snapshot isacquired when the page is initially loaded by the crawler 60 forindexing (FIG. 2). The snapshot is rendered, converted to the compressedformat and stored. When the subject page is selected in a search (FIG.3), transmitted to the user are the individual snapshots, which havebeen stored locally to the search portal processor 78, in associationwith the index/categorization database. In this way the snapshots 35 ofthe hit page (which may be one of a number of hits that are reported touser 30) is shown when providing the search report.

The snapshots 35 can be contained in formatted image files (e.g., GIF,JPG, etc.). The snapshot image files, or URL addresses pointing to theimage files, preferably are stored in the database 62 that also containsthe URL addresses of the indexed pages. In reporting search results, thesearch engine 78 inserts a link 82 aiming to the snapshot image file 35into the html search results page 80. The search results appear on theusers browser 84 as a link to selected pages with an associated snapshotof the page when indexed, as shown in FIG. 3.

These operations impose challenges that are addressed according to theinvention. One problem with acquisition of the snapshots is due the verylarge number of websites that must be physically rendered, namely everywebsite that is indexed and is available in the universe of websitessubject to search. The website content, including any referenced imagefiles, must be downloaded by the crawler Agent A process(es) 66 andrendered by the rendering Agent B process(es) at acceptable speeds, andpreferably also reduced to obtain reasonably sized image files 35. Theimage files must be accessibly stored and downloaded from the searchengine 78 to the searcher (specifically the user's browser 84) atacceptable speeds as well. The invention applies particular technologyto solve these and other problems.

Major search engine portals each have a usually-proprietary “robot” orautomated process that crawls the web as described above. In each searchportal or system a robot or crawler that accepts or finds website URLaddresses, accesses websites by TCP/IP addressing and loads their sourcecode. The crawler robot automatically parses the text of a website,namely dividing the strings found in the source code into unitsseparated by delimiters such as spaces or punctuation. The strings andthe succession of strings are compared to stored parameters wherebycertain strings are construed as links or formatting commands, which isnoted accordingly. The occurrence and proximity of these strings and thefree content strings that are to appear as text in the web page whendisplayed on a browser, are all noted and stored in a database wherethis information is cross referenced to the URL address of the websitefrom which the page was loaded.

In operating a conventional crawler and indexing routine as discussed,the website text can be analyzed and indexed at an extremely high rateof speed because the page is treated only as a succession of textstrings. No processing time is spent to load and process or otherwisehandle any embedded or referenced graphics, media, scripting, Java, oranimations. Such files are not helpful for traditional indexing and thusare not requested. The html tags that might be used to find and loadfiles for non-text content may be textually parsed, but their associateddata files are never requested and not retrieved by the traditional textcrawlers employed by the major search engines. In addition to avoidingprocessing overhead, no time is devoted by the crawler for data transferthat might be needed to request and receive packets containing thegraphic or other media files. The load on the crawler is minimizedbecause the portion of the website that is loaded and processed, namelythe text portion, represents little “weight” in communications bandwidthrequirements, processing time and the like for most web pages. Withoutthe need to download and process large graphic and media files, simpletext indexing in the traditional sense by conventional crawlers is veryefficient, simple, and fast.

Although simple text indexing is quick and simple, the exact opposite isthe case for full graphic rendering of a web page. Before the display ofa web page can be completed, it is necessary for the browser to wait sothat each and every required file is downloaded. The browser must waitfor all necessary files to be received before a full rendering of thedisplay. Additionally, any script or otherwise dynamic content normallyawaits receipt of the entire file before processing begins. Furthermore,image, graphic and media files are very data intensive and thus requiresubstantially increased transmission times in comparison to text.

A web page will contain one single text file, but in contrast maycontain dozens of graphic and media files. Traditional text crawling bythe major search engines require that only the one single text file betransmitted and parsed. By contrast, full graphic rendering employed bythe current invention requires that each and every graphic, image, andmedia file be transmitted and subsequently rendered into a full visualdepiction of a web page.

In a conventional web crawler installation, dozens of robots can run onthe same processor simultaneously, all executing their individual taskswithout regard of the other robots present. Employing a large number ofrobots on the same computer processor facilitates conventional textindexing. Also, the conventional crawler is only concerned withprocessing text data. The crawler processes need not include many stepsrequired of a browser to handle the graphical content. Specifically,conventional crawler processes do not include generating and presentinga visual display, which would require additional network communication(to obtain graphics, etc), consume time and processing power, andrequire devotion of system resources such as the visual display itself(e.g. monitor).

The text data portion of a web page is most commonly five to ten Kbytesin length and is received in less than a second on a typical networkconnection. The text file is normally the first file sent from theoriginating web server. Image files and script or other code, ifrequested, follow afterwards. The robotic processes of requesting a textfile, retrieving packets and reassembling the text file, parsing thetext file by finding terms within delimiters, and indexing its contents,can be accomplished under normal circumstances in 0.5 to 1.5 seconds.Assuming a one second average processing time, one computer processoroperating, for example, 25 text processing web crawler robots (which maybe conservative), can obtain and index the text of 25 web pages persecond every second. Operating continuously, such a crawler couldprocess over 15 million web pages per week. Certain factors limit therate at which pages can be processed. Web congestion, long files, longtransmission sequences, low bandwidth server connections, and otherfactors that vary from one website to another and one time of day toanother may limit processing speed. Nevertheless, a search engine portalthat has several computers with multiple robots devoted to crawling theweb, might complete an entire crawling sequence through a reasonableuniverse of selected web pages, in three or four weeks.

By comparison, complete and total processing of web pages, includingrendering all graphics requires a substantial increase in resources. Ifa typical website has text content of about 5 Kbytes, that same textfile may have any number of associated graphic files, each of which isseveral times the size of the entire text file. All the web page datamust be downloaded totally and processed before accurately rendering theweb page, because the data may affect the rendering even if the dataitself will not appear on the screen.

A website server is usually prompt in sending short files, such as therequested text of a particular page, and short file transmissions aremore frequently successful than longer ones due to the additional packethandling for reassembling the file, and the increased possibility oftransmission errors requiring retransmission. The browser receiving andprocessing the graphic file seems to pause or to stick on presenting aparticular graphic section during the resulting delay. The transmissionmay pause at any point, even on the last packet of a number ofsuccessively transmitted files. The receiving browser or other processorcannot complete the total and full rendering of the web page, fordisplay or otherwise, until the delay elapses. The receiving computersimply waits before completing the display of the page.

For rendering a page layout including graphics, the browser or pagerendering robot normally requires on the average 30-45 seconds per pageto receive and process a web page into a graphically visual layout (anapproximation that incorporates a variety of factors including changesin bandwidth, server lag, and lost packets which can result in web pagesbeing delayed).

The graphic layout of the page usually comprises a series of imagefiles. Each file consists of or is unpacked into an array of digitaldata words representing the saturation, luminance and hue or therespective RGB levels of each pixel in an X-Y field corresponding to thedisplay screen area. On a computer running a browser, the image file isloaded into a series of memory locations accessed by the display driverto drive the monitor display, either in the processor random accessmemory or in the memory of a video display driver card (or both). Theprocess of rendering a page into a visually graphic layout usuallyrequires devoting a full display memory field to this function, andparticular aspects of processors are often devoted to handling a limitednumber of display images. As a result, only a single image processingapplication or graphical robot can visually produce the intended webpage layout on the screen at any one time. In other words, rendering thepage layout of the website at its intended dimensions (displaying a fullscreen) can only be accomplished using a single graphical application orweb browser at a time.

Web pages are intended by their creators to be seen at a size renderedat or near full screen dimensions. Obviously, only one full screen webpage can be displayed at any given time on one screen, and as a result,only one graphic robot and its associated hardware can be active torender that display at any single instance in time. This situation isthus unlike the way text is processed by traditional web crawlers,wherein a single computer processor is capable of running dozens oftextual web crawlers simultaneously “in the background.” This is becauserequesting, retrieving, and indexing text from a web page does notcommit visual or display generating resources. Without the requirementto share this type of resource, any number of the text indexing type ofweb crawlers can run at one time.

Because of the limitations, constraints and resources used for renderingand display, crawling the entire web for the purpose of successivelyrendering web pages to produce a display can be impractically slow. Ifthe conventional text retrieving robot is capable of indexing 1 page persecond, a graphic rendering robot is capable of processing 1 pagedisplay every 45 seconds. As a result, a computer running 25simultaneous text retrieving robots can index an estimated 15 million(15,000,000) web pages per week, but the same computer running a 1graphic rendering robot would process and estimated 15 thousand (15,000)web pages per week. If there are 100 million web pages in the desireduniverse, graphically rendering the entire universe of searchable websites on one computer processor would require approximately 6,600 weeksor nearly 127 years to complete. Even employing 25 different computersystems would require over 5 years to complete a graphical rendering ofthe desired 100 million web pages.

According to an aspect of the present invention, at least twoindependent types of intelligent web agents are cooperatively operatedto handle different aspects of the job of retrieving, rendering, andprocessing websites, in a manner that makes it possible to producegraphic data in the form of a compressed or reduced graphic filerepresenting the appearance of a rendered website, and to do so at anacceptable rate. The first type of intelligent web agent (now to bereferred to as “Web Agent A”) requests, retrieves, and downloads eachand every file associated with a particular website, including but notlimited to the source code text file, graphic files (e.g., GIFs, JPGsand others), script files, Java executeable files, Flash technologyfiles, Shockwave files, animations, and so forth. Web Agent A isarranged to communicate or pass data into one or more memory buffers orqueues accessible to a second type of intelligent web agent (now to bereferred to as “Web Agent B”), which siphons out the website files asneeded to produce and render complete graphical displays of the webpage.

The rendering process by Web Agent B comprises processing the text andhtml tagging data to prepare a visual representation. All the filesnecessary to render the image have preferably been obtained by Web AgentA before then, and such files are stored in the buffer. Web Agent Bproduces a full visual representation, such as a bitmap file containinga pixel data array, which if coupled to a display driver could be usedto display the web page layout on the video monitor at full screendimensions. In short, Web Agent B prepares a visual image as might beprovided by a browser.

The visual display of the web page is then compressed by Web Agent B ora process associated with it, to a predetermined and preferably smallimage size, for example a 2 in.×2 in. image on a 17 inch diagonallymeasured display screen. This process may involve sampling or local areaaveraging techniques as known in the art. The reduced size bitmap imagethen is digitally compressed and/or encoded to minimize storagerequirements and to permit quick transmission over an ASCII-only datachannel. The reduced size bitmap image can be converted into a JPG, GIFor similar format for an image file suitable for web transmission. Thatimage file, which represents the rendered appearance of the associatedweb page at a particular point in time, is stored in a mass memoryaccessible to the search engine. The mass memory can be in one or morehard drives, ram caches, writable CD ROMs or other media that is usefulas a high capacity RAM. The mass memory can be a peripheral on thesearch system or can be accessible to the search engine, for exampleusing communications over a local area network, provided that the imagefiles are very quickly recallable using a minimum of data communicationsand/or communications that are direct rather than over the web.

The mass memory can have a subdirectory naming system and file namingsystem based on the network addresses or URLs of the web pages fromwhich the graphic files were generated, or alternatively the files canbe arbitrarily named or stored and can be found using a cross referencetable in the search engine whereby the address or URL of the web pageand its associated image file are cross referenced.

The search engine memory also comprises text indexing data or humancategorization directory data (or both), that is obtained in aconventional web crawler manner and includes an association between thetext data found at each web page and the web address or URL of theoriginating web page. In this way, the text indexed or categorized data,and the graphic file location, are both indexed to the URL. By selectinga URL, the search engine can call up the graphic file representing itsappearance when rendered at some time in the past. After receiving aselection containing one or more text strings, Boolean combinations,file extension types or other criteria, the search engine can determinethe matching web pages, report their URLs and provide a graphic fileshowing a miniature window version of how they would have appeared ifloaded by a browser at substantially the time when their data was loadedand indexed.

Web Agent B preferably has additional functions, including keepingstatus information such as storing log files containing addresses and/orlinked file names that have been attempted and obtained, optionallyincluding a queue of files that presented problems when first tried andshould be re-tried or after a time will be rendered with missing-graphicgaps, web addresses that have been completely rendered, etc. Preferablythe logs and status indicators are sufficient to permit an operator tomonitor operation by reference to by readouts or by displaying storeddata. Web Agent B also preferably generates error messages and/or alarmsin the event of any crucial errors. Status readouts available caninclude rudimentary data such as the current URL being processed, therendering state of the current URL, the number of URLs processed sinceinception or last clear, any error messages and so on.

The search engine can comprise one or a number of processors and theprocessors can be in direct communication or linked on a local networkor other arrangments, the key being quick access to the stored databaseof data representing the universe of web pages that have been processedand therefore are searchable. The search engine accepts user searchcriteria in a conventional way, such as using CGI form boxes to entertext strings into an associated search engine entry html page that isaddressable by a browser. The search engine permits selections to bemade according to at least one search criterion and preferably accepts avariety of different criteria types and combinations. These aspects ofthe search engine can be of the type conventionally used by currentsearch engines such as Hotbot, Yahoo, Alta Vista, Northern Light, etc.The search engine is operable to select web page hits as a function ofuser supplied search criteria and to determine the URL addresses of webpages (hits) that wholly or partly meet the criteria. In addition todetermining the URLs of hits, the search engine may store and retrieve abrief exemplary text string such as the initial few lines of text in theweb page hit.

The search engine reports search results to the user that entered thesearch criteria, by composing an html source page and transmitting it tothe user. This html report page may identify no hits or a long list ofhits, depending on the search results. In composing the report page, thesearch engine typically shows the search criteria used, and displaysindicia summarizing or similarly identifying each web page hit. Forexample, the search report can identify hits by the URL of theoriginating web page. Preferably a short text selection such as thefirst few lines of text is shown. The html coded report page prepared bythe search engine includes an associated hyperlink to the URL of eachhit. The URL can be shown in plain text and provided with an associatedhypertext link (href=[URL]). The user reviews the URLs, sample text orother information and activates the hyperlink of a selected web pageidentified in the results, thereby loading the web page presently foundat the address of the originating page when processed by the crawlerrobots.

According to the invention, the composed search report page prepared bythe search engine includes but is not limited to the URL of each webpage, the title of each web page, a description of each web page, and agraphic depiction of each web page. The user's browser immediately loadsthe source code, which contains the text portion of the search report.In processing the source, the user's browser encounters the links to theimage files that were included by the search engine when composing thesearch report page and obtains the image file. Preferably the reportpage composed by the search engine places the graphic for the web pagehits immediately adjacent to the associated text and hyperlink. Thegraphic image was rendered under certain assumptions as to the displayconfiguration and represents a snapshot of the web page frozen in time.The snapshot is at least an approximation of how the web page willappear if the link is activated and the page is loaded by the user(i.e., if the page is unchanged and the user's display configuration isequal to the default configuration assumed by the Agent B of theinvention). Unless the page has been substantially changed by its owner,the graphic depiction will substantially assist the user in sifting thepages that are definitely interesting versus possibly interesting,neutral, unlikely to contain pertinent material or definitelyirrelevant.

It is an aspect of the invention that the assets and processing power ofthe search engine system are proportioned to coordinate the operation ofAgent A (for fetching) and Agent B (for handling image content), wherebyneither one substantially lags the other. Agent A is subject tocommunication delays involved in requesting, receiving and storing theneeded files from the internet, which can delay a single robot, but infact is ameliorated by running multiple copies of Agent A in thebackground. Agent B has more data to process, but due to the preloadingby numerous Agents A in the background can process the data quickly fromlocal copies. Agent B is free to monopolize the display in theforeground while multiple Agents A in the background acquire necessaryfiles from the internet and feed them into a temporary data buffer.

In view of the communication delays and to maintain the pace, it ispresently preferred that 32 web agents of type A operate in conductionwith each web agent of type B. Thus a plurality of web agents of type Acontinuously fetch and feed into a buffer or queue all web page files oftargeted web pages, including their source code and their graphicimages, such as JPG, GIF, Java, Flash, etc., all being stored locally.One or more web agents of type B, preferably one for a number of AgentsA (e.g., 32) continuously processes and removes files from this bufferto produce and render one web page snapshot image after another.Concurrently with this process, the text portion of the web page data isindexed or categorized.

The ratio of Agents A to Agents B can be determined from experience suchthat the contents of the buffer or queue remain substantially stable forthe particular search engine. Alternatively, the ratio can be changed onthe fly so as to keep Agent B constantly working and to keep the size ofthe buffer or queue stable. If the queue continues to grow, the ratio ofAgents A to Agents B can be reduced, thereby committing more of theavailable CPU time to Agent B, which should cause the buffer to shrink.The buffer should not be allowed to shrink indefinitely, or Agent B willbecome idle or will lose efficiency or even stall, waiting for completesets of web files to become available. Preferably, an optimal buffersize is assigned, such as some hundreds of MBytes. Additionally, thisbuffer is maintained relatively static by the deletion of data after itis used by Agent B. After startup of an estimated optimum ratio ofAgents A to Agents B, additional Agent A processes can be added until asubstantial portion of the available communications time is filled withactive Agent A messages. If the buffer grows continuously, Agent Aprocesses are reduced in number relative to the number of Agents B, andvice versa. Inasmuch as the optimum ratio is in part due tocommunication delays due to web congestion, the ratio of Agents A to Bcan be varied throughout a processing day.

Web Agent B continuously renders and processes web pages one afteranother according to a specified queue. Web Agent B does not suffer fromthe limitations and overhead of requesting and transferring files overthe internet because these problems are solved by the team of web agentsof type A, for example thirty-two of which may be busy addressing andloading files from different sources.

In one embodiment tested, a single web Agent B was employed in acomputer engaged as described above. Conventional browser and displaydriver routines were used to render bitmap display files from html pagesthat had been revised such that all included image reference linkspointed to graphics files that had been previously downloaded by one ofthe plurality of operating Agent A processes and stored in the queue orbuffer, namely on the system hard drive. An image conversion utilitythen converted the display bitmaps into a GIF image files under filenames referenced to the corresponding URL of the originating web page.This arrangement proved to be an efficient and fast method to obtainsnapshot renderings of web pages. Web Agent B in such an arrangementcontrols and manipulates all processing and system resources forgraphical display but is not held back by the delay of retrieving andstoring of the necessary files, which is collectively performed by allthe Web Agents of type A, running as concurrent processes in thebackground and thus not requiring many of the system resources,including the display buffers and drivers. The system proved efficientlycabable of rendering at least one web page per second, and if runcontinuously would render 86,400 pages per day, 604,800 per week. Thismay seem like an adequate rate, but assuming a desired universe of 100million pages, a single computer system crawling at that rate wouldstill need approximately 3 years to complete a crawling cycle. Duringthat time, the content of most of the web pages would have been changed.Therefore, the invention is preferably applied running a number ofcomputers operating concurrently. Networking to a common database andrunning 18 computers concurrently would allow a complete rendering of adesired 100 million web sites every 2 months. It is preferred that sucha 2 month cycle be utilized to maintain a fresh and updated database ofgraphic snapshots.

It is not unusual during an initial attempt to retrieve a web page usinga browser, including retrieval of its included or referenced graphicfiles, that at least one of the files is not successfully transferred.This may be due, for example, to congestion or other factors causing thewebsite server to time out and issue an error message. Sometimes a fileis garbled in transmission and this is detected by the receivingbrowser, which visually marks the displayed page to show that there is amissing file (e.g., a rectangle is placed at the image position with ared “X” indicating that the transfer was unsuccessful or the receivedfile was defective and could not be decoded and/or displayed as animage). In that situation, the browser “refresh” function often can beinvoked to make one or more additional tries to retrieve the rest of thewebpage, at the user's point/click command.

According to the invention, in such a situation a built-in redundancydeals with damaged or missing files. Web Agent A is responsible forretrieving and storing the graphics files, and all associated files of aparticular web page. In so doing the originating server or anintermediate router may time out or transmit a damaged version of thefile. If a file is not received or a received file is defective, whichis not infrequent when browsing, Web Agent B of the invention detectsthat the file is missing or defective (in other words, Web Agent Bnotices that the necessary file is in fact, not in the buffer as itshould be). Web Agent B can be arranged to attempt one or more times toretrieve missing files from the address specified in the html sourcecode, (i.e. to obtain the graphic file again “live”, directly off theweb). Preferably, however, if Web Agent B is ready to render a file andone or more graphic files is not found, then Web Agent B can signal oneof the Web Agent A processes to attend to fetching the file, and duringthe delay Web Agent B proceeds to render another file whose componentfiles are all available. With a redundancy or retry capability, thesystem is likely to successfully render the whole webpage, with all itsgraphics and all its associated files, more dependably than a browserresponsive to live file downloads. In fact, this redundancy brings thesuccess rate to nearly 100%.

The respective crawling, communication, indexing, and rendering programfunctions can be written in any of a variety of available programminglanguages and can run on any of a number of different platforms. Theprogram has been found to be readily embodied in C++ running on aWindows NT operating system.

It is an aspect of the invention that available communications bandwidthis used efficiently. The multiple Agent A processes operatingconcurrently are such that the usual reason for waste of communicationstime, namely waiting for a response from a remote web page server, isminimized because delay experienced by one of the Agent A processes isused by the other Agent A processes that are operating at the same time.The invention can perform on any bandwidth connection, including 28Kbps. Of course a high bandwidth connection is preferred, such as one ormore T1 or T3 connections (if not even higher).

Apart from the example of Windows NT, the Unix platform is alternativelyuseful according to the invention due to its capability of handlingmultiple simultaneous processes. The respective software robots can runon the Unix platform as applications programmed, for example, in C, C++,Perl or one of the other languages. To finish crawling cycles reasonablypromptly, in a preferred arrangement numerous computers are employedsimultaneously, each having its own connection to the internet and eachemploying its own embodiment of the current invention. The computers canreside on a network and feed off of and simultaneously contribute to acommon database maintained by one of the computers on the network.

The two general functions associated with preparing the database ofinformation which is then subject to search and reporting, are thefunctions of retrieving all webpage data (performed by Web Agent A), andgenerating a “snapshot” file from the data (performed by Web Agent B).It is found that these functions can operate concurrently with or apartfrom the search engine processor or processors that search the databaseof information and return results to the requesting user. The preferredembodiment, however, is to perform all processing in regards torendering, resizing, and compressing the snapshot prior to beingaccessible to surfers on the web. A cycle of processing (crawling,indexing, rendering) preferably is completed and the index and snapshotfiles that result are loaded into a database or are used to update adatabase, maintained on the server that accepts user search criteria andcomposes and sends to the user the search results.

Web Agent A attempts sequentially (or randomly or otherwise) to load allthe web pages listed in a large database of URL addresses that werecompiled previously from various sources. A compilation of URL addressesmight be built up by trying to download composed URLs based ondictionary words (e.g., http://www.aardvark.com . . . ) or company namesfrom a name directory (e.g., http://www.acme.com . . . ) or known URLsfrom a domain name service, or even all sequential string combinationsone after another. The tried and true way to compile a list of addressesfor a web crawl is to start with URL addresses from an existingcompilation of web page addresses, such as a domain name listing; toload each one sequentially; and to scan through the source of the loadedpages for all the hypertext links to other URL domain names and/or URLweb page addresses. These latter linked web pages are then added to thecompilation of URL addresses, and crawled (loaded and also scanned forlinks) at some later time.

The search system of the invention preferably permits anyone to suggesta web page to be added to the universe of searchable pages. Thesuggested web page is added to the compilation, and the search engine'srobots crawl the web by loading the suggested page, noting and loadingthe pages linked to the suggested page and continuing on to the pagesthat are linked to the linked pages, etc. Duplicates are removed. URLsthat have been recently visited can be flagged for deferred reload, orremoved.

Another preferred method incorporates the use of a human reviewed andcompiled database. A “human surfer” or web page reviewer may be moredependable than a robotic agent in categorizing the content of web sites(e.g., “The Electric Factory” is the identifier of a concert promoterand supplier of tickets to entertainment events). Both methods can beutilized to compile a database of websites. A team of human surfers canbe employed for the task, each visiting successive websites and makingdeterminations, for example, as to an appropriate title, description,category or the like. The current invention provides additionalenhancement to a human compiled database in that the content of awebsite is even more quickly apparent if any descriptive terms or titlesare considered together with a snapshot of the content, even ifminiaturized to the extent that most or all of the text shown in thesnapshot may be too small to be readily discerned.

In a preferred arrangement of the invention, the processing isaccomplished in a network of programmed processors that are in a datacommunication with one another and each of which has a TCP/IPcommunication link to the web. The database containing the universe ofcrawled or to-be-crawled target web sites, which may number in themillions, can be stored in a controlling processor or can be part of ashared data store used to allocate individual URLs to client computerson the search system network, such as by permitting Web Agent A toobtain the next URL from the list and to flag the URL as in use. It isnot strictly necessary to use the network paradigm. Instead, each WebAgent A or each client computer running multiple Web Agents of type Acan contain its own database with a subset of the URLs of the universe,and the databases of a number of robots or clients can be synchronizedperiodically to eliminate duplicates, flag URLs after they have beencrawled, and similarly updated. In a typical application, the databaseserves out a URL to the next Web Agent A in the queue and moves an indexor “pointer” to refer to the next URL to be served out.

Web Agent A receives the URL, makes a TCP/IP request for the web pageover the web, and attempts to download the source code and all thenecessary graphic files and data needed to render that website. WebAgents of type A are preferably programmed to “patiently” request andawait download of files, but also intelligent as to which of the filesto ignore (for example audio files are ignored) and whether to continueto attempt downloading if successive attempts have been unsuccessful.Integrity, byte count, parity and similar checks can be performed toensure that the download is complete and correct.

In dealing with websites containing “frames,” which are actuallymultiple documents that are loaded and displayed in tandem at a definedand potentially variable portion of a browser display screen, eachdocument typically has an end-of-file code and issues a downloadcomplete message to the Operating System.

Often a framed web page can accept and display any of a number of otherweb pages as an inset frame. This complicates matters in that theend-of-file that actually concerns only part of the framed page mighterroneously trigger the Web Agent to move on to the next website and toprocess the frame but not the framed content.

Frames also present a problem for the crawler robot regarding embeddedhtml links to other web pages. The owner of a frames web page caninclude html links to web pages of others. If a surfing browser attemptsto load the linked page by selecting (clicking on) the link on theframes web page, the browser will load the linked page but it will bewithin the frame of the first web page owner. The browser is not linkedindependently in that case and instead is linked through the framespage. Thus the html target address that appears in the browser toolbarand is recorded in the browser's history list is not a link to theselected site. Instead it is a link to the frames page, with a modifierthat identifies the selected site. When that target address is invoked,the frame is loaded and the linked web page is inserted into the frame.

In queuing embedded links found on pages for processing, Web Agent Adistinguishes framed links from direct links. When processing a framedpage, preferably, the crawler invokes the framed page's internal linksto find and queue additional links, but does not treat every framed linkas a new web page. Insofar as Web Agent A encounters websites withframes, it processes the data local to that web site and checks for thepresence of a website with frames. When a frame page is detected, theWeb Agent A checks for a download complete message (end-of-file) forevery framed element and processes the text and graphics of the frameand the contents both.

Web Agent A preferably detects dynamic occurrences that are programmedinto web sites, from the html source code that is received. Agent A cankeep only a portion of the content of a particular file, such as thefirst frame of an animated GIF, or can wholly ignore the file, such asan audio file, a data entry form script or video clip, etc. There are avariety of situations in which a web site may be arranged to displaytext or graphics sequentially or conditionally, or to link the user todifferent files. These include automatic re-routing to a further linkafter a delay or after a user input such as a mouse click, pop upwindows for temporary display of a graphic on top of a background, CGIprompt boxes for entering data, data that varies inherently such asvideo windows, sound files, animated GIF images and other similaroccurrences.

According to an inventive aspect, Web Agent A of the invention dealswith changing data by loading as much of the text and graphic data asthe target web page will supply, and storing a sufficient collection ofthe graphics and linked files to prepare a static version of the targetpage upon initial access. This requires Web Agent A to search the sourcecode received from a site for indications of dynamic content and tosuppress the dynamic aspect of the content. However, the dynamic aspectis preferably not omitted entirely, and instead is limited to a staticdisplay of the initial content encountered.

Accordingly, sound files (WAV, MID, MP3, etc.) are suppressed andignored. For example in downloading the html source, Agent A deleteslinks, as a function of their file extensions, before storing the file,and of course does not attempt to download the files themselves.Animated graphics preferably are partially loaded (e.g., only the firstframe of an animated GIF) or the graphic files are fully loaded by theWeb Agent A but are only partly processed by the Web Agent B. Videocontent can be processed to obtain an initial frame, but preferablyvideo is ignored and is replaced by a link to a static graphic thatmarks the video and the file type. For example, MOV video files can bemarked by a static Apple Quicktime icon, or ASF files marked by a staticWindows MediaPlayer static icon, etc. The static markers preferably arechosen by file extension (e.g., for video, RAM=RealPlayer, ASF=WindowsMediaPlayer, MOV=Quicktime), or a generic marker is used for all theseformats, or perhaps only for the generic formats that all the playerscan process (e.g., MPG). Either Web Agent A or Web Agent B can processthe target site to link to or to present the static display marker forsuch files.

Similar markers can be used to indicate the presence of media that isnot displayed. For example, an icon or character (e.g., “_”) canindicate when a link to an audio file is detected. As in the foregoingdiscussion of video, the icon also can be chosen as a function of thefile extension to indicate the type of audio file found, such as WAV,MID, MP3, etc.

According to further aspects of the invention, pop up windows areignored or suppressed. Dialog boxes, unlike pop up windows, are somewhatmore complex and may obstruct the display of background page featureswhen displayed. A dialog or data-entry CGI box, may suspend theprocessing of a page until the dialog box is handled. Rather thanpermitting a dialog box such as a name or password box to suspendoperation of Web Agent A, a dialog boxes is detected and triggersrunning of a “cancel” routine in response to a dialog box. Assuming thatthe site is operating password-control or a similar process, thatprocess is discontinued for failure to enter the password or the like,but Web Agent A can continue on and may obtain additional graphic filedata or text after the dialog box or similar prompt has been passed.

Animated GIFs and other changing features can also be identified by anicon indicating the presence of that feature. Preferably these animatedfeatures are selectively processed to provide a static image. AnimatedGIFs and some other technologies such as Macromedia Flash, provide anaction sequence in the form of a plurality of images that are displayedin quick succession, normally in a loop. It is a problem withanimations, especially those pertaining to Macromedia Flash Technologyto select which frame will be captured or selected as representative ofthe animation. Animated GIFs begin with a graphic and the subsequent“frames” may be limited only to those pixels that have changed colorfrom one frame to the next. Flash Technology usually begins with a blankscreen or blank square. Choosing the first frame of a Flash movie as thedesignated frame to process and render would certainly be unaccepteable.According to alternative solutions, the Web Agent B can employ a timerto wait a predetermined time before capturing the rendered image in afile of the type that starts as a blank or fades in. It may be a matterof luck what in particular will be present at the moment captured in thechanging portion of the display. An alternative is to generate a staticimage as a sum or average of two or more changing frames, which mayproduce a smeared static image. Another alternative is to disable theFlash plug in by a suitable message to the target site when loading thepage. Disabling the Flash plug may eliminate any graphic data, namely ifthe website operators did not provide a static HTML page as analternative to be presented for users who are not outfitted for Flash.Often, a user without Flash is presented with a blank screen with a tinycaption at the bottom reading “If you do not have Flash, click here.” Arendering and subsequent snapshot of a screen similar to this could bemisleading to the user if viewed within the search results of a searchengine, so a timed capture is preferred.

It is an aspect of the current invention to provide an icon or similarindication within the search results as to whether or not a particularwebsite contains Flash Technology. This alleviates possibleinconsistencies in processing and rendering a Flash movie, andsubsequent interpretation by the user of a search engine who may beviewing the snapshots. Moreover, for Flash and similar technologies thatare optional for users, adding an indication of their presence benefitsusers of the search results. Specifically in the case of Flash, a userwho has loaded the Flash plugin or otherwise has the capability toprocess the content will prefer to access pages that contain Flashcontent if other factors are equal. Users with browsers incapable ofprocessing Flash technology might be forewarned that their browser mayhave difficultly rendering that particular website, or at the leastwould be neutral about that aspect of the web site. The use of Flash,RealAudio and other “value added” technologies is often an indicationthat a particular website has superior content.

Therefore, in a preferred embodiment, the presence of Flash content isdetected. A static page is captured according to one or more of theforegoing alternatives, preferably by disabling the Flash Plug-in. Aconventional static graphic is displayed in the snapshot image, andadjacent to the static graphic an icon is inserted to show that the siteis a Flash site. The same technique can be used to identify otherdynamic displays, such as Shockwave Movies and the like, preferablyusing distinct icons for each type.

In the preferred arrangement shown in FIGS. 1-3, each computer employedby the search engine system has one database, a plurality of Web AgentA's, and a single Web Agent of type B. While the Web Agent A's areoccupied with downloading necessary text data and graphic files in thebackground, the single Web Agent B is busy in the foreground renderingpages and performing coordinate based screen captures. Most commonly,screen captures are performed at a bit depth or resolution of 24-bits,and thus comprising 16.7 million possible colors in the captured image.To minimize data overhead and to maximize efficiency, a coordinate basedsystem is utilized to execute the desired image capture.

By operating Web Agent B in the foreground, the invention can takeadvantage of certain display facilities without corresponding processingoverhead. Such facilities may include, as available, display processinghardware, software, firmware, coprocessors, memory caches and possiblyperipherals such as display driver cards, which might normally be usedto facilitate fast updates to a display during the foreground operationof a program.

According to an inventive aspect, the system as described can beconfigured to operate using a plurality of independent computers thatare in data communication (e.g., on a common network, or having accessto a particular memory store either concurrently or by virtue ofpreparing a mass memory media such as CD ROMs containing the storagemedia database, using one or more computers, and then processing thedatabase for searches using one or more additional computers. In oneparticular arrangement, one computer (or a subset of a group ofcomputers) exclusively runs Web Agent A processes, for downloading data,files, media, graphics, etc. This large number of Web Agents of type Aor processes incorporating similar capabilities, deposits downloadedfiles into a datastore, such as a hard drive, removable drive, or thelike. The deposited data can then be transferred via network or onmovable media to a different computer running Web Agent of type B.

Web Agent B processes the data to provide reduced/compressed web pageimages or snapshots on graphic data files. This second computer runningWeb Agent B accesses the datastore to render and process websitesaccording to a specified queue. The point is that it is not necessary tohave both types of Web Agents on the same computer to enable properexecution of the system, and it may be efficient to separate thesefunctions as described. Separation of the index preparation function inwhich the storage database is processed to ready it for searching, andthe searching and reporting functions in which user queries areaccepted, the storage database is searched and a report is composed andreported, can also be separated onto additional computers that eachserve particular functions. In this way, operating together andpreferably including allocation of additional resources at anyprocessing and communication bottlenecks, the system can obtain data,prepare the data for searching by preprocessing the data, includingproducing graphic image files, and conduct and report searches viainteraction with remote users.

In a preferred form of the invention, both types of Web Agents run onthe same machine. When one of the Web Agents of type A downloads a webpage, it stores all elements of the page, both text and graphics andincluding files that may be linked to each subject page but stored at aa different server address, and saves the URL address and the associatedfile names. The URL is added to Web Agent B's input queue. All of theWeb Agents of type A perform this same process, namely attemptingdownloads and when a download is complete, placing that URL in the WebAgent B's queue. In this way the Web Agent B normally cannot outpace theWeb Agent As, even though the latter are occupied to some extent withwaiting for transmitted data to be sent by a remote website server.

The Web Agent B undertakes processing after all the files necessary tocomplete the processing have been downloaded and accessibly stored. Forexample, Web Agent A or Web Agent B (or another process such as aprocess that parses the received source code for text indexing) scansthrough the source code and thereby determines the files that are neededfor processing, namely the files or addresses to which hyperlinks arefound in the source code. In an embodiment in which Web Agent A handlesthis process, the web page can wait to be queued for Web Agent B untilWeb Agent A has successfully loaded and stored all the files.Alternatively, a list of the associated files can be prepared by WebAgent B or by another process with access to the source code, and WebAgent B can check the list before attempting to process the data for theweb page. In any event, preferably the processing capacity of Web AgentB is substantially devoted to processing pages that are complete whentheir processing commences.

Web Agent A, or another process, can be arranged to continue to attemptto load any of the necessary files that has not been loaded.Alternatively, Web Agent B can undertake a new communication on the webin an effort to retrieve the missing file, or can queue one of the WebAgents of type A, or another process, to obtain the file or to reloadthe missing file(s) or perhaps the entire web page and associated files.Reloading the entire web page deals with the possibility that a filethat is found to be missing or unavailable may no longer be linked inthe web page source code, and thus is unnecessary. Discontinuance of thelink might also be the reason why the file has not been found (i.e., itwas removed and deleted).

Performing from locally stored text and data files, Web Agent B canrender and capture graphic image files or snapshots at an improved rate.In an embodiment wherein one Agent B and several Agent A processes wereoperative on one computer to accumulate stored files locally and toqueue Agent B, Agent B was found able to produce graphic snapshot filesat a rate of about one web page per second. This is much faster thandownloading and rendering one page at a time as such would be the casewith a normal browser, wherein transmission waits make the typical rateone web page per 45 seconds.

The rendered image file is captured from the display buffer memory ofthe operating system, and then is resized, processed to increase imagequality, and compressed. It is then stored on disk in a standard formatunder a file name associated with the URL of the originating web page.

Upon completion of a full crawl, rendering of each and every desired website, and full data storage of the resulting graphic snapshots, thesearch engine database is ready to accept user queries. The userpresents combinations of text string expressions in a known manner.According to the same sort of search criteria known in other searchengine applications (e.g., HotBot, Alta Vista, Yahoo, etc.), thecriteria are compared to the indexed text information. By whatever meansused (e.g., all words, any word, exact phrase, Boolean combinations,with or without results ranking or categorization, etc.) the searchengine selects and prepares a list of the web page hits discovered bycomparing the search criteria to the contents of the indexed database.

A report listing is prepared by generating a reporting web page in htmlsource code, which is then sent to the user. The reporting web pageincludes a list of hits wherein each entry on the list comprises an htmllink to the URL from which the associated web page was downloaded.Preferably, and as already done with most search engines, the entriesalso include at least a line or two of text from the web page, such asthe first three lines. Additionally, according to the invention theentry also has an html link to the graphic file on the search enginecomputer where the snapshot of the rendered web page is stored. Thislink can be an IMG SRC=[path] [filename] command.

When the user reviews the search report using a browser, the browserinserts the graphic snapshot image adjacent to the listing of the URLlink to the subject web page. Thus the user can determine whether a pageentry in the search results is of interest, not only from the textinformation included with the URL link such as a description and title,but also from a small size presentation of what the web page looked likewhen it was indexed.

If the user is interested in reviewing the web page to which the searchreport entry is directed, the user can click on the hypertext link tothe URL of the web page, whereupon the user's browser loads the web pagedirectly from its original web page server. The snapshot imagepreferably is associated with the hypertext link redundantly, so thatthe user can click either on the hypertext link or on the snapshot imageand in either case will be linked by the URL to the originating webpage.

There are some timing issues. Between the time that the web page wasdownloaded and the time that the user clicks on a search result entry toreview the page, the contents of the page may have changed. If a websiteoperator updated or changed the layout of that website since it wasrendered and processed by the snapshot software (Web Agent A and WebAgent B), it is possible that the visual aspect as seen through theuser's browser no longer coincides with the snapshot image in the searchresults. Nevertheless, the snapshot normally shows a mostly consistentvisual representation of the current content of the web page.

Numerous algorithms were tested to generate the ideal snapshot from theraw image data, (effectively to convert a bitmap image in the displaymemory of the computer to a GIF or JPG file to be stored on the disk ofthe search engine computer). It is utterly essential to the performanceof the system to obtain a high image quality and small file size.However, it is a fact of computer science that these two objectivescontradict each other. Under normal circumstances, you can have one orthe other, but not both. The higher the image quality, the larger thefile size, and subsequently, the longer a user of the search engine hasto wait for the snapshot to download. On the other hand, creating asmall file with less data, will result in a faster download for the userbut will also result in poor, unacceptable image quality as it pertainsto the snapshot. Not all algorithms are programmed the same, and infact, some are found to be superior to others.

The algorithms necessary to control resizing, image quality, andcompression are programmatically controlled to create the resultantgraphic snapshots. To provide a perpetual, never ending crawl andgraphical rendering of websites on the internet, it is necessary toautomate all functions, including those found in commercial software sothat they may be performed without human intervention. Web Agent B, uponcomplete rendering of a web page, programmatically manipulates theaforementioned algorithms and subsequently ensures the proper storage ofthe resulting graphic snapshot onto disk. Additionally, Web Agent Bperforms a test to determine whether the graphic snapshot is of a higherquality in GIF format or JPG format. It should be noted that newalgorithms or other existing algorithms may be operable and may bepreferable in other operating situations.

Obviously individually resizing, sharpening, compressing, and convertingeach and every bitmap screen capture to produce the desiredrepresentative snapshot would prove prohibitive. A manner by whichautomation and speed to perform this process is warranted. An element ofthe snapshot software system is to programmatically control software toperform these actions, for example by manipulation of subroutines fromcommercial software. This can be accomplished using C++ programming toaccess certain files and processes normally regarded as internal to thecomputer operating system. In particular, the memory locationscontaining the bitmap image intended for the browser display, generatedby the operating system (e.g., browser, display drivers, etc.) isco-opted and used as the source file for generation of a compressedgraphic image file in an efficient format for storage and datatransmission. In particular, a bitmap-to-GIF or bitmap-to-JPG conversionis effected on the contents of the display buffer stored in RAM. Exactlywhich conversion is determined by Web Agent B.

Upon the completion of processing the original bitmap screen captureinto a snapshot, all the raw data files used to render the image nowcaptured are deleted to prevent the data store from overfilling. Thatis, the original html source code can be deleted together with thegraphic files addressed in the source code and downloaded for inclusionin the rendering, leaving in storage only the representation of the website in the database by its URL address, its text indexing and/orcategorization and the ultimate graphic snapshot in an image file crossreferenced to the stored address. The bit size of the graphic snapshotfile is approximately 1/200 the size of the original data. Byautomatically and continuously deleting raw files from the data storeafter processing, the necessary data for searching and reporting with arendered depiction remains available (albeit a small depiction), andstorage capacity requirements remain manageable.

A popular search engine or website portal may receive numerous serverrequests per day. The search engine opening page may have numerousincluded text, graphic and interactive elements, each element requiringa communication request for transmission of a data file. Top searchengines are visited by millions of users every day, and each search cangenerate numerous “hits”. Some of the search portals personalize thepresentation to users. If a search engine is visited by millions ofusers per day, it has to serve multiple millions of operations and datatransfers. According to the present invention, the search engine canreport improved information without the corresponding overhead.

Search engine visitors are very impatient and tests show that they arenot willing to wait very long for results to be reported. The inventionexpedites the search and search reporting process while improving thecontent of the results, and encourages users to remain loyal to theirpreferred search engine.

Conventional search engines report results in a format that apart fromadvertising and preset information is limited to text and text formattedas links to the URL addresses of pages in the hit list. This textualform can be reported in a very small file size as compared to the numberof hits reported, thereby limiting server overhead and decreasinginternet download time. If a browser was arranged to attempt to load andrender snapshots during the receipt and display of a search report, aserious technical challenge and communication load would result.

According to a further aspect of the present invention, the snapshotrendering feature is preferably enabled and disabled by user option,which can be a point upon which the user personalizes his/her access tothe web portal containing the search engine of the invention. For thispurpose, the user can be assigned a code that is stored in a cookie thatis sent to or made available to the portal, or a cookie containing bitflags in which the user can set and unset options such as snapshotreporting.

It is a further aspect of the invention to employ a system comprising aplurality of optimized “snapshot servers”. The snapshot servers accessand deliver graphic snapshots from storage, to the network address of auser to whom a search is being reported. The snapshot servers canconduct packet data transmission, serve requests for re-send, etc. Thesnapshot servers remove the overhead of reporting graphic files from thesearching processes, and thus ensure that search reports are as quick aspossible. The search process passes the graphic file names and usernetwork address to the snapshot servers. The snapshot servers transmitthe graphic snapshot files corresponding to each search report followingshortly after the search report text.

In a preferred embodiment, the snapshot servers employ a RAM cache forstorage of some or all of the snapshot images to be reported to users.This contributes further to the reporting speed because it is notnecessary to await the addressing and loading of image snapshot filesfrom the system hard disk and the hard disk does not become an unduebottleneck. Upon system startup the library of quick access snapshotgraphics can be copied from a hard drive into the RAM cache. The cachedfiles can be all of the snapshot graphic files or only those found afterexperience to be most frequently addressed. The snapshot serverspreferably share or employ a large cache, for example at least 1Gigabyte and preferably 10 Gigs or greater.

For determining the frequency of addressing, the snapshot serverspreferably contain a program or process that counts or calculates thetwo million most requested snapshots. This can be updated on a weeklybasis. Although any number of snapshots could be maintained within rapidaccess of the search engine's database, a predetermined number of thosefound to be most requested, such as two million, are kept directly inthe memory cache (hence a cache size of 10 Gigs, or approximately 5KBytes per image). The status of a given page as being among thepredetermined number (e.g., two million) that are most often requestedor at least most often reported in searches, can be indicated in thegraphic results, for example by adding a frame to the snapshot that isreported by transmitting an additional frame graphic.

In a preferred embodiment, the textual portion of search results alwaysis sent and caused to appear first, prior to the snapshots correspondingto those results. As a result, regardless of whether the user has turnedthe snapshots capability “ON” or “OFF”, the text portion appears first.If a user so desires, he can abort the transmission of the results basedon review of the initially received portion. This is accomplishedthrough programming within the snapshot server system that queues thetext portion of the search results to be “released” or transmittedfirst, preferably even before addressing (or perhaps even checking forthe presence on the corresponding snapshots.

A number of additional variations and further embodiments are possibleand will become apparent to persons skilled in the art in view of thisdisclosure. The invention is not intended as limited to the precisearrangements disclosed as examples. Accordlingly, reference should bemade to the appended claims for assessing the scope of exclusive rightsclaimed.

I claim:
 1. A method for processing data files stored at distributedaddresses on a data processing network, at least some of the data fileshaving text and graphic content, the method comprising: analyzing atleast a subset of the data files to produce a database of informationcharacterizing aspects of the data files that tend to distinguish thedata files from one another, and cross referencing said information toaddresses of the data files; generating an image of at least a portionof the subset of data files, and storing a graphic file of said image ina manner cross referenced to the addresses of the data files, wherebythe graphic file represents an image of the data files at a time ofgeneration; receiving search queries and applying the search queries tothe database for selecting a hit list from among the data files;reporting the hit list in a search report including the addresses ofeach of the data files selected and the image corresponding to the datafiles in the hit list at the respective time of generation.
 2. Themethod of claim 1, wherein the data files comprise hypertext markuplanguage text and linked graphic format files, and wherein saidanalyzing comprises at least one of indexing the text and reviewing atleast a portion of the data files for assignment of an arbitrarycategorization.
 3. The method of claim 2, wherein the data filescomprise hypertext markup language text and linked graphic format fileson one of an intranet and the World Wide Web, and wherein saidgenerating comprises rendering an image corresponding to the data filesaccording to a predetermined display configuration defining a defaultchoice of at least one of a pixel display size, font type, color pallet,color resolution and use of colors.
 4. The method of claim 2, whereinsaid analyzing and said generating are accomplished using at least twoprocesses, one of said processes collecting the hypertext markuplanguage text and the linked graphic format files and another of saidprocesses rendering the graphic files as a presentation of respectivesaid data files.
 5. The method of claim 4, comprising a greater numberof said processes collecting the files than a number of said processesrendering the presentation.
 6. The method of claim 4, comprising storingin a buffer each of the files collected by said processes collecting thefiles, queuing the process for generating the image, and deleting thefiles in the buffer after generating the image.
 7. The method of claim4, comprising operating said process rendering the presentation fileusing at least part of a computer's display facility to produce abitmap, and converting the bitmap into a graphic format file.
 8. Themethod of claim 4, comprising operating said process rendering thepresentation file by reducing a display size of the bitmap andconverting the bitmap into a graphic format file.
 9. The method of claim2, wherein said reporting of the hit list comprises composing ahypertext report page containing selectable links for addressingcorresponding said data files, and transmitting the hypertext reportpage to a user submitting a query, and wherein the report pageadditionally includes an image link addressing the graphic file for atleast a portion of the hits.
 10. The method of claim 1, furthercomprising stripping at least one variable aspect of the data files,said aspect comprising at least one of a time changing display feature,a user interactive feature and a nonvisual media feature.
 11. A networksearch engine for managing user selection of information contained ondata files stored at distributed network addresses on a globalinformation processing network wherein distributed users have controlover associated data files accessible by other users, each of said datafiles having at least some associated text and each of the data fileshaving at least one mode of graphic presentation, comprising: a crawlerhaving at least one processor operable to address and load successivedata files comprising at least a subset of said data files stored atsaid distributed network addresses, the crawler being operable toproduce and store a database of information characterizing aspects ofthe data files that tend to distinguish the data files from one another,cross referenced to addresses of the data files; and, wherein thecrawler is further operable to produce graphic image files representingat least some of the data files, the graphic image files eachcorresponding to content of corresponding said data files at a point intime, and wherein the crawler is operable to store the graphic imagefile so as to cross reference the graphic image file to the data filesin the database.
 12. The network search engine of claim 11, furthercomprising programmed processes operable to receive search queries fromnetwork users, to apply the search queries to the database for selectinga hit list from among the data files and to report the hit list in asearch report including the addresses of each of the data files selectedand the image corresponding to the data files in the hit list at therespective time of generation.
 13. The network search engine of claim11, wherein the data files comprise hypertext markup language text andlinked graphic format files, and wherein said analyzing comprises atleast one of: indexing the text for storing a text index crossreferenced to network addresses of the data files; and reviewing atleast a portion of the data files for assignment of an arbitrarycategorization and for storing a categorization cross referenced to thenetwork addresses of the data files.
 14. The network search engine ofclaim 11, wherein the crawler operates at least two discrete processesfor collecting the data files and files linked thereto, and forproducing the graphic image files.
 15. The network search engine ofclaim 14, wherein the discrete processes operate together on at leastone processor, and wherein the processes for collecting are morenumerous than at least one said process for producing the graphic imagefiles.
 16. The network search engine of claim 15, wherein the processfor producing the graphic image file renders an image of the data filesfrom downloaded copies of the hypertext markup language text and linkedgraphic format files, and converts a resulting display image file intothe graphic format file.
 17. The network search engine of claim 16,wherein the process for producing the graphic image file renders theimage of the data files according to a configuration selected as adefault configuration with respect to at least one of use of changingvisual features, presentation of user interactive features, presentationof non-visual media, display pixel resolution, color pallette, colorresolution, and use of colors.
 18. The network search engine of claim17, further comprising a programmed process for producing the graphicimage file, which co-opts a display bitmap from a processor programmedto present the data files, and converts the bitmap to a compressedgraphic format file stored on the search engine.
 19. An improvedInternet search engine for managing user search and selection of webpages stored at distributed systems coupled at network addresses to theInternet, the search engine having an associated web crawler operable toaddress and load successive web pages, and to index text data associatedwith said successive web pages so as to obtain parameter informationthat distinguishes at least groups of the web pages from one another,the crawler storing the parameter information and associated addressesof the web pages, and the search engine being operable responsive touser submitted search criteria to search the parameter information andto report at least the associated addresses of web pages that met thesearch criteria when indexed, wherein the improvement comprises: saidcrawler being operable in conjunction with obtaining the parameterinformation for at least a subset of said successive web pages togenerate a graphic image file containing a visual image that issubstantially identical to an appearance of said web pages, for displayin a size proportionally smaller than said web pages; and wherein thesearch engine is operable when reporting the associated addresses of webpages that met the search criteria to include a representation of thegraphic image file in said proportionally smaller size.
 20. The improvedInternet search engine of claim 19, wherein the crawler generates thegraphic image file with an appearance of the web pages according to apredetermined default display configuration of a browser.
 21. Theimproved Internet search engine of claim 20 wherein the predetermineddefault display configuration defines a selection of at least one ofrelative font size and type, colors and pixel aspect ratio.
 22. Theimproved Internet search engine of claim 21, wherein the search enginereports to the user the associated addresses of the web pages that metthe search criteria, in a form of hypertext source data containing URLlinks to said web pages, and wherein the graphic image file is displayedin association with a URL link to the web page represented by thegraphic image file.
 23. The improved Internet search engine of claim 21,wherein the graphic image file comprises a compressed pixel image of abitmap corresponding to said web pages.
 24. The improved Internet searchengine of claim 22, wherein the graphic image file is transmitted as animage link in the hypertext source data to a file compressed by at leastone of MIME, Binhex and Base64.
 25. A system comprising: a fetchingagent configured to receive a website file via at least one networkinterface, wherein the website file is associated with a web page; arendering agent configured to generate, based on the website file, avisual representation file that represents a rendered appearance of theweb page that is substantially identical to an appearance of the webpage and to compress the visual representation file of the web page intoa reduced image file, wherein the reduced image file represents areduced-size rendered appearance of the web page for display in a sizeproportionally smaller than the web page, and wherein the renderingagent is further configured to limit a dynamic aspect of dynamic contentin the website file to a static display, wherein the static displaycomprises an image from the dynamic content in the web page at a fixedtime; a memory, configured to store the reduced image file and at leastone network address associated with a network location of the websitefile, wherein the memory is further configured to cross reference thereduced image file with the at least one network address; and a firstplurality of fetching agents and a second plurality of rendering agents,and wherein a ratio of the first plurality of fetching agents to thesecond plurality of rendering agents is modified during processing ofwebsite files to maintain a consumption of the memory within a range offractions of a capacity of the memory.
 26. The system of claim 25,wherein the fetching agent is further configured to utilize a pluralityof concurrently active requests for web pages.
 27. The system of claim25, wherein the fetching agent is further configured to remove dynamiccontent from the website file.
 28. The system of claim 25, wherein theweb page includes nontext data.
 29. The system of claim 25, furthercomprising: a search portal configured to: receive search criteria; andgenerate a report based on the search criteria, the report comprisingthe at least one network address and one of: a link to the reduced imagefile, and a copy of the reduced image file.
 30. The system of claim 25,wherein the at least one network address comprises a uniform resourcelocator (URL) address of the web page.
 31. The system of claim 25,wherein the static display comprises a frame of the dynamic content atthe fixed time.
 32. The system of claim 25, wherein the rendering agentis further configured to include an icon in the reduced image fileindicating that the web page includes dynamic content.
 33. The system ofclaim 32, wherein the icon indicates a type of the dynamic content. 34.A method comprising: receiving, at a computer, a website file via atleast one network interface, wherein the website file is associated witha web page; generating, at the computer, a visual representation filethat represents a rendered appearance of the web page that is based onthe website file, wherein the generating limits a dynamic aspect ofdynamic content in the website file to a static display, wherein thestatic display comprises an image from the dynamic content in the webpage at a fixed time; compressing, at the computer, the visualrepresentation file of the web page into a reduced image file, whereinthe reduced image file represents a reduced-size rendered appearance ofthe web page; storing, at the computer, the reduced image file and atleast one network address associated with a network location of thewebsite file, wherein storing the reduced image file and the at leastone network address comprises cross referencing the reduced image filewith the at least one network address; and modifying a ratio of firstplurality of fetching agents to second plurality of rendering agentsduring processing of website files to maintain a consumption of memorywithin a range of fractions of a capacity of the memory.
 35. The methodof claim 34, wherein the web page includes nontext data.
 36. The methodof claim 34, further comprising: receiving search criteria; andgenerating a report based on the search criteria, the report comprisingthe at least one network address and one of: a link to the reduced imagefile, and a copy of the reduced image file.
 37. A non-transitorycomputer-readable storage medium having instructions stored thereon, theinstructions comprising: instructions for receiving a website file viaat least one network interface, wherein the website file is associatedwith a web page, instructions for generating, based on the website file,a visual representation file that represents a rendered appearance ofthe web page, wherein the generating limits a dynamic aspect of dynamiccontent in the website file to a static display, and wherein the staticdisplay comprises an image from the dynamic content in the web page at afixed time; instructions for compressing the visual representation fileof the web page into a reduced image file, wherein the reduced imagefile represents a reduced-size rendered appearance of the web page;instructions for storing the reduced image file and at least one networkaddress associated with a network location of the website file, whereinthe instructions for storing the reduced image file compriseinstructions for cross referencing the reduced image file with the atleast one network address; and instructions for modifying a ratio offirst plurality of fetching agents to second plurality of renderingagents during processing of website files to maintain a consumption ofmemory within a range of fractions of a capacity of the memory.
 38. Thenon-transitory computer-readable storage medium of claim 37, wherein theinstructions for receiving, the instructions for generating, theinstructions for compressing, and the instructions for storing areconfigured to operate on a single processor.
 39. The non-transitorycomputer-readable storage medium of claim 37, wherein the instructionsfor receiving, the instructions for generating, the instructions forcompressing, and the instructions for storing are configured to operateon multiple processors.
 40. The non-transitory computer-readable storagemedium of claim 39, wherein at least one of the multiple processorsexclusively executes the instructions for receiving.
 41. Thenon-transitory computer-readable storage medium of claim 37, wherein theweb page includes nontext data.
 42. The non-transitory computer-readablestorage medium of claim 37, further comprising: instructions forreceiving search criteria; and instructions for generating a reportbased on the search criteria, the report comprising the at least onenetwork address and one of: a link to the reduced image file, and a copyof the reduced image file.
 43. A non-transitory computer-readablestorage medium having instructions stored thereon, the instructionscomprising: instructions for receiving a file via at least one networkinterface, wherein the file includes formatting information;instructions for generating, based on the formatting information, avisual representation file that represents a rendered appearance of thefile, wherein the generating limits a dynamic aspect of dynamic contentin the file to a static display, wherein the static display comprises animage from the dynamic content in a web page at a fixed time;instructions for compressing the visual representation file into areduced image file, wherein the reduced image file represents areduced-size rendered appearance of the file; instructions for storingthe reduced image file and at least one network address associated witha network location of the file, wherein the instructions for storing thereduced image file comprise instructions for cross referencing thereduced image file with the at least one network address; andinstructions for modifying a ratio of first plurality of fetching agentsto second plurality of rendering agents during processing of websitefiles to maintain a consumption of memory within a range of fractions ofa capacity of the memory.
 44. The non-transitory computer-readablestorage medium of claim 43, further comprising: instructions forreceiving search criteria; and instructions for generating a reportbased on the search criteria, the report comprising the at least onenetwork address and one of: a link to the reduced image file, and a copyof the reduced image file.